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Abstract. We present an efficient algorithm for calculating g-gram frequencies on strings represented 
in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents 
string T, the algorithm computes the occurrence frequencies of aZZg-grams in T, by reducing the problem 
to the weighted g-gram frequencies problem on a trie-like structure of size m = |T| — dup{q,T), where 
dup{q, T) is a quantity that represents the amount of redundancy that the SLP captures with respect to 
q-grams. The reduced problem can be solved in linear time. Since m = 0{qn), the running time of our 
algorithm is 0(min{|r| — dup{q, T), qn}), improving our previous 0{qn) algorithm when q = n{\T\/n). 

1 Introduction 

Many large string data sets are usually first compressed and stored, while they are decompressed 
afterwards in order to be used and analyzed. Compressed string processing (CSP) is an approach 
that has been gaining attention in the string processing community. Assuming that the input is 
given in compressed form, the aim is to develop methods where the string is processed or ana- 
lyzed without explicitly decompressing the entire string, leading to algorithms with time and space 
complexities that depend on the compressed size rather than the whole uncompressed size. Since 
compression algorithms inherently capture regularities of the original string, clever CSP algorithms 
can be theoretically ^12|4J,0»7J, and even practically [17|9j . faster than algorithms which process 
the uncompressed string. 

In this paper, we assume that the input string is represented as a Straight Line Program (SLP), 
which is a context free grammar in Chomsky normal form that derives a single string. SLPs are a 
useful tool when considering CSP algorithms, since it is known that outputs of various grammar 
based compression algorithms |15|14j , as well as dictionary compression algorithms |22|20|21|19j can 
be modeled efficiently by SLPs [16]. We consider the q'-gram frequencies problem on compressed text 
represented as SLPs. q-gram frequencies have profound applications in the field of string mining and 
classification. The problem was first considered for the CSP setting in [11] . where an 0(|i7j^n^)- 
time 0(n^ )-space algorithm for finding the most frequent 2-gram from an SLP of size n representing 
text T over alphabet U was presented. In [3], it is claimed that the most frequent 2-gram can be 
found in 0(|i7pnlogn)-time and 0(n log |T|)-space, if the SLP is pre-processed and a self-index is 
built. A much simpler and efficient 0{qn) time and space algorithm for general q >2 was recently 
developed [9]. 

Remarkably, computational experiments on various data sets showed that the 0{qn) algorithm 
is actually faster than calculations on uncompressed strings, when q is small [9]. However, the 
algorithm slows down considerably compared to the uncompressed approach when q increases. 
This is because the algorithm reduces the g-gram frequencies problem on an SLP of size n, to the 
weighted g-gram frequencies problem on a weighted string of size at most 2{q — l)n. As q increases, 
the length of the string becomes longer than the uncompressed string T. Theoretically q can be as 
large as 0(|r|), hence in such a case the algorithm requires 0{\T\n) time, which is worse than a 
trivial 0(|r|) solution that first decompresses the given SLP and runs a linear time algorithm for 
g-gram frequencies computation on T. 



In this paper, we solve this problem, and improve the previous 0{qn) algorithm both theo- 
retically and practically. We introduce a (7-gram neighbor relation on SLP variables, in order to 
reduce the redundancy in the partial decompression of the string which is performed in the pre- 
vious algorithm. Based on this idea, we are able to convert the problem to a weighted 5-gram 
frequencies problem on a weighted trie, whose size is at most \T\ — dup{q,T). Here, dup{q,'T) is a 
quantity that represents the amount of redundancy that the SLP captures with respect to (7-grams. 
Since the size of the trie is also bounded by 0{qn), the time complexity of our new algorithm 
is 0(min{gn, \T\ — dup{q,T)}), improving on our previous 0{qn) algorithm when q = Q{\T\/n). 
Preliminary computational experiments show that our new approach achieves a practical speed up 
as well, for all values of q. 



2 Preliminaries 



2.1 Intervals, Strings, and Occurrences 



For integers i < j, let [i : j] denote the interval of integers {«,... For an interval [i : j] and 
integer q > 0, let pre{[i : j],q) and suf{[i : j], q) represent respectively, the length-g prefix and suffix 
interval, that is, pre{[i : = \i : min(i + q — 1, j)] and suf{[i : j],q) = [max(i,j — q + 1) : j]. 

Let U he a finite alphabet. An element of U* is called a string. For any integer q > 0, an element 
of Z"^ is called a q-gram. The length of a string T is denoted by \T\. The empty string e is a string 
of length 0, namely, |e| = 0. For a string T = XYZ, X, Y and Z are called a prefix, substring, and 
suffix of T, respectively. The i-th. character of a string T is denoted by T[i], where 1 < i < \T\. For 
a string T and interval [i ■ < i < j < \T\), let T{[i : j]) denote the substring of T that begins 
at position i and ends at position j. For convenience, let T{[i : j]) = e j < i. For a string T and 
integer q >0, let pre{T,q) and suf{T,q) represent respectively, the length-(7 prefix and suffix of T, 
that is, preiT,q)=T{pre{[l : \T\],q)) and sufiT,q)=T{suf{[l : \T\],q)). 

For any strings T and P, let Occ{T,P) be the set of occurrences of P in T, i.e., Occ(T,P) = 
{k > \ T[k : A; + |P| — 1] = P}. The number of elements \Occ{T, P)\ is called the occurrence 
frequency of P in T. 



2.2 Straight Line Programs 



of 



A straight line program (SLP) is a set 
assignments T = {^i — > expr i,X2 — )• 
expr2,...,Xn — > expvn}, where each Xi is a 
variable and each expri is an expression, where 
expri = a (a G or expri = -'^^(j)-^r(i) > 
l{i),r{i)). It is essentially a context free gram- 
mar in the Chomsky normal form, that derives 
a single string. Let val{Xi) represent the string 
derived from variable Xi. To ease notation, we 
sometimes associate val{Xi) with Xi and de- 
note \val{Xi)\ as \Xi\, and val{Xi){\u : v]) as 




Fig. 1. 

X2 

valiXr) 



The derivation tree of SLP T 
b, ^3 — > X1X2, Xi — >■ X1X3, 



= {Xj ^ a, 
Xs — >■ X3X4, 



X4X5, X7 — >■ XeXs}, 
— aababaababaab. 



representing string T = 



Xi{[u : v]) for any interval [u : v]. An SLP T represents the string T = val(Xn). The size of the 
program T is the number n of assignments in T. Note that |T| can be as large as 0(2"'). However, 
we assume as in various previous work on SLP, that the computer word size is at least log|T|, 
and hence, values representing lengths and positions of T in our algorithms can be manipulated in 
constant time. 



The derivation tree of SLP T is a labeled ordered binary tree where each internal node is labeled 
with a non-terminal variable in {Xi, . . . ,Xn}, and each leaf is labeled with a terminal character 
in E. The root node has label X„. Let V denote the set of internal nodes in the derivation tree. 
For any internal node v (zV, let {v) denote the index of its label Xf^^y Node v has a single child 
which is a leaf labeled with c when (X^^^ — ?> c) € T for some c G Z", or v has a left-child and right- 
child respectively denoted £{v) and r{v), when G T. Each node v of the tree 
derives val{X(^y^), a substring of T, whose corresponding interval itv{v), with T{itv{v)) = val{X(^^^), 
can be defined recursively as follows. If v is the root node, then itv{v) = [1 : \T\]. Otherwise, if 

G T, then, itv{l{v)) = [by : by + | - 1] and itv{r{v)) = [by + | : 

By], where [by : Cy] = itv[v). Let vOcc{Xi) denote the number of times a variable Xi occurs in the 
derivation tree, i.e., vOcc{Xi) = [{v | X^^^ = Xi\[. We assume that any variable Xi is used at least 
once, that is vOcc{Xi) > 0. 

For any interval [b : e] of T(l < 6 < e < |T|), let (,Tib,e) denote the deepest node v in the 
derivation tree, which derives an interval containing [6 : e], that is, itv{v) ^ [& : e], and no proper 
descendant of v satisfies this condition. We say that node v stabs interval [b : e], and X(^y^ is called 
the variable that stabs the interval. If 6 = e, we have that {X(^y^ c) € T for some c G X", and 
itv{v) = 6 = e. If 6 < e, then we have (X^^,) gT, b G itv{£{v)), and e G itv{r{v)). 

When it is not confusing, we will sometimes use iri^-,^) to denote the variable X^^^j-^g)^. 

SLPs can be efficiently pre-processed to hold various information. [Xi[ and vOcc{Xi) can be 
computed for all variables < i < n) in a total of 0(n) time by a simple dynamic programming 
algorithm. Also, the following Lemma is useful for partial decompression of a prefix of a variable. 

Lemma 1 ([8j). Given an SLP T = {Xi — )• expri}'^^-^^, it is possible to pre-process T in 0{n) time 
and space, so that for any variable Xi and 1 < j < [Xi[, : j\) can be computed in 0{j) time. 

The formal statement of the problem we solve is: 

Problem 1 (q-gram frequencies on SLP). Given integer q > i and an SLP T of size n that represents 
string r, output {i,[Occ{T, P)[) for all P G Z"' where Occ{T,P) / 0, and some i G Occ{T,P). 

Since the problem is very simple for q = 1, we shall only consider the case for q > 2 for the rest of 
the paper. Note that although the number of distinct g-grams in T is bounded by 0{qn), we would 
require an extra multiplicative 0{q) factor for the output if we output each q^-gram explicitly as a 
string. In our algorithms to follow, we compute a compact, 0(gn)-size representation of the output, 
from which each (/-gram can be easily obtained in 0{q) time. 

3 0{qn) Algorithm [9] 

In this section, we briefly describe the 0{qn) algorithm presented in [9j. The idea is to count 
occurrences of g-grams with respect to the variable that stabs its occurrence. The algorithm reduces 
Problem [T] to calculating the frequencies of all g-grams in a weighted set of strings, whose total 
length is 0{qn). Lemma [2] shows the key idea of the algorithm. 

Lemma 2. For any SLPT = {Xi — )• expri}'^^^ that represents string T, integer q > 2, and P G U'^ , 
[Occ{T, P)\ = J27=i vOcc{Xi) ■ [Occ(ti,P)|, where ti = suf{val{X£(^i)),q — l)pre{val{Xr(^i^) , q — 1). 

Proof. For any q > 2, v stabs the interval [u : u + q — 1] and only ii [u : u + q — 1] Q [sy : 
fy] = suf {itv{£{v)),q — 1)[J pre {itv{r{v)),q — 1). (See Fig. O) Also, since an occurrence of in the 



derivation tree always derives the same string val{Xi), ti = T([s^ : fy\) for any node v such that 
Xi^^^ = Xi. Therefore, 

\Occ{T,P)\ = |{u > I T([n -.u + q-l]) = P]\ 

= Y,\{u>^\iT{u,u + q-l)=v,j=u-s^ + l,Xi^^){\j:j + q-l\)=P}\ 

n 

= Y^ J2 \{u>0\^r{u,u + q-l)=v,j = u-s^ + l,X^^}{[j :j + q-l])=P}\ 

i=l t)eV:Jf(„)=Xi 
n n 

= Y. Occ{T{[s,: f,]),P) = Y,^Occ{Xi)- Occ{ti,P). 

i=l vGV:X(^^)=Xi 1=1 




□ 

Prom Lemma [21 we have that occurrence frequen- 
cies in T are equivalent to occurrence frequencies in ti 
weighted by vOcc{Xi). Therefore, the g-gram frequencies 
problem can be regarded as obtaining the weighted fre- 
quencies of all (7-grams in the set of strings {ti, . . . 
where each occurrence of a g-gram in ti is weighted by 
vOcc{Xi). This can be further reduced to a weighted q- 
gram frequency problem for a single string z, where each 
position of z holds a weight associated with the g-gram 
that starts at that position. String z is constructed by 
concatenating all tj's with length at least q. The weights 
of positions corresponding to the first \ti \ — (q — l) charac- 
ters of ti will be vOcc{Xi), while the last {q — 1) positions 
will be so that superfluous (;-grams generated by the 
concatenation are not counted. The remaining is a simple 
linear time algorithm using suffix and Icp arrays on the weighted string, thus solving the problem 
in 0{qn) time and space. 











q-1 


q-1 














1 


, 







Fig. 2. 

X 



T. 



(eT{","+9-i)> 



Length-g intervals where 

= Xi, and {Xi — > € 



4 New Algorithm 

We now describe our new algorithm which solves the g-gram frequencies problem on SLPs. The new 
algorithm basically follows the previous 0{qn) algorithm, but is an elegant refinement. The reduc- 
tion for the previous 0{qn) algorithm leads to a fairly large amount of redundantly decompressed 
regions of the text as q increases. This is due to the fact that the ij's are considered independently 
for each variable Xi , while neighboring g-grams that are stabbed by different variables actually share 
g — 1 characters. The key idea of our new algorithm is to exploit this redundancy. (See Fig. [3l) 
In what follows, we introduce the concept of g-gram neighbors, and reduce the g-gram frequencies 
problem on SLP to a weighted g-gram frequencies problem on a weighted tree. 



4.1 q-gram Neighbor Graph 

We say that Xj is a right q-gram neighbor Xi (i ^ j), or equivalently, Xi is a left q-gram neighbor 
of Xj, if for some integer u e [1 : \T\ - q], = Xi and = Xj. Notice 

that \Xi\ and \Xj\ are both at least q if Xi and Xj are right or left g-gram neighbors of each other. 




Fig. 3. g-gram neighbors and redundancies. (Left) Xj is a right q-gram neighbor of Xi, and Xi is a left g-gram 
neighbor of Xj. Note that the right g-gram neighbor of Xi is uniquely determined since > g and it must 

be a descendant on the left most path rooted at X^fi), However, Xj may have other left g-gram neighbors, since 
l^«0)l < 1j ^^'^ they must be ancestors of Xj. ti (resp. tj) represents the string corresponding to the union of 
intervals [u : u + g — 1] where = Xi (resp. = Xj). The shaded region depicts the string 

which is redundantly decompressed, if both ti and tj are considered independently. (Right) Shows the reverse case, 
when < g. 



Definition 1. For q > 2, the right q-gram neighbor graph of SLP T = {Xi — >■ expri}^^^ is the 
directed graph Gq = iV^Er), where 

y = {Xi|iG{l,...,n},|X,|>g} 

Er = I Xj is a right q-gram neighbor of Xi } 

Note that there can be multiple right g-gram neighbors for a given variable. However, the total 
number of edges in the neighbor graph is bounded by 2n, as will be shown below. 

Lemma 3. Let Xj be a right q-gram neighbor of Xi. If, \Xj.(^i-^\ > q, then Xj is the label of the 
deepest variable on the left-most path of the derivation tree rooted at a node labeled whose 
length is at least q. Otherwise, if < q, then Xi is the label of the deepest variable on the 

right-most path rooted at a node labeled Xf^i^j-^ whose length is at least q. 

Proof. Suppose [ > q. Let n be a position, where X(^^^{^u,u+q-i)) = and „+g)^ = Xj. 

Then, since the interval [n -|- 1 : u + q] is a prefix of itv{Xj.^i^), Xj must be on the left most path 
rooted at X^i^i). Since Xj = X^^.^^^^^ the lemma follows from the definition of ^7-. The case 

for I < g is symmetrical and can be shown similarly. □ 

Lemma 4. For an arbitrary SLP T = {Xi expri}f^i^ and integer q >2, the number of edges in 
the right q-gram neighbor graph Gq of T is at most 2n. 

Proof. Suppose Xj is a right g'-gram neighbor of Xi. From LemmaO we have that if \Xr(i) \ > q, the 
right g-gram neighbor of Xi is uniquely determined and that 1X^(^)1 < q- Similarly, if |Xr(j)| < q, 
\Xe{j) \ ^ Q and the left g-gram neighbor of Xj is uniquely Xi. Therefore, 

n n 

\{{Xi,Xj) G Er I >q}\ + Y, \{iX^^Xj) e Er I |X,(,)| < g}| 

i=l i=l 
n n 

= \{{Xi,X,) G Er I >q}\ + J2 lii^^^Xj) e Er I > g}l < 2n. 

i=l 1=1 

□ 



Lemma 5. For an arbitrary SLP T = {Xi — ?> exprj}"^^ and integer q > 2, the right q-gram 
neighbor graph Gq ofT can be constructed in 0{n) time. 

Proof. For any variable Xj, let lmg{Xi) and rmq{X.i) respectively represent the index of the label 
of the deepest node with length at least q on the left-most and right-most path in the derivation 
tree rooted at Xj, or null if \Xi\ < q. These values can be computed for all variables in a total 
of 0{n) time based on the following recursion: If (Xj a) & T for some a € S, then lmq{Xi) = 
rmq{Xi) = null. For {Xi X<7(j)Xr(j)) G T, 



lmq{Xi) 



null if \Xi\ < q, 

i if \Xi\ > q and |X^(j)| < q, 



^ Iniq (X^( j) ) otherwise. 

rmq(Xi) can be computed similarly. Finally, 

Er = {{Xi,Xi^^(^x,^^^)) I lmq{Xr{i)) / null,i = 1, . . . ,n} 
^{{Xrmq(Xn,)),Xi) I rmq{Xi(^i)) / null,i = l,...,n}. 



□ 



Lemma 6. Let Gq = {V^E^) be the right q-gram neighbor graph of SLP T = {Xi = expri}^^^ 
representing string T, and let = X^^^(;^ Any variable Xj £ V{ii 7^ j) is reachable from Xi^, 
that is, there exists a directed path from Xj^ to Xj in Gq. 

Proof. Straightforward, since any g-gram of T except for the left most T([l : q]) has a g-gram on 
its left. □ 



4.2 Weighted q-gram Frequencies Over a Trie 

From Lemma m we have that the right g-gram neighbor graph is connected. Consider an arbitrary 
directed spanning tree rooted at Xj^ = X^^^(i which can be obtained in linear time by a depth 
first traversal on Gq from Xj^ . We define the label label{Xi) of each node Xj of the g-gram neighbor 
graph, by 

label{Xi) =ti[q: \ti\] 

where ti = suf{val{X^{j^^), q — l)pre{val{X^^i^),q — l) as before. For convenience, let Xj^ be a dummy 
variable such that label{Xi^) = T{{1 : q — 1]), and X^i^tg) = Xj^ (and so (Xj(,,Xjj) G E^). 

Lemma 7. Fix a directed spanning tree on the right q-gram neighbor graph of SLP T, rooted at Xj^ . 
Consider a directed path Xj^ , . . . , Xj^ on the spanning tree. The weighted q-gram frequencies on the 
string obtained by the concatenation label{Xig)label{Xij^) ■ ■ ■ label{Xi^), where each occurrence of a 
q-gram that ends in a position in label{Xi.) is weighted by vOcc{Xi.), is equivalent to the weighted 
q-gram frequencies of strings {tjj, . . . ti^} where each q-gram in ti. is weighted by vOcc{Xi.). 

Proof. Proof by induction: for m = 1, we have that label (Xi^) label (Xi^^) = ti^. All g-grams in tj^ 
end in ti^ and so are weighted by vOcc{Xi^). When label{Xi.) is added to label{Xi^^) ■ ■ ■ label{Xi._^), 
\label{Xi.)\ new g-grams are formed, which correspond to g-grams in tj^, , i.e. \ti.\ = q—l+\label{Xi^)\, 
and ti. is a sufhx of label{Xi._^)label{Xi.). All the new g-grams end in label{Xi.) and are thus 
weighted by vOcc{Xi^). □ 



Algorithm 1: Constructing weighted trie from SLP 



1 Construct right g-gram neighbor graph G — {V, Er); 

2 Calculate vOcc{Xi) for i = 1, . . . , n; 

3 Calculate \label{Xi)\ for i — 1, . . . ,n; 

4 for i — 0, . . . ,n do visited [i] = false; 

5 Xi-^ = — llTlq(Xn); 

6 Define Xi^ so that X^^ig) — Xi-^ and |/a6e/(Xi„)| = q — 1; 

7 root new node; // root of resulting trie 

8 |BuildDepthFirst| [io , root); 

9 return root 



Procedure BuildDepthFirst(i, trieNode) 

II add prefix of r(i) to trieNode while right neighbors of i are unique 

1 Z •<— 0; fc j; 

2 while true do 

3 I I \lahel{Xk)\; 

4 visited [fc] true; 

// exit loop if right neighbor is possibly non-unique or is visited 

5 if < g or visited[img(Xr(fc))] = true then break; 

6 |_ fc ^ ;m,(X^(fc)); 

7 add new branch from trieNode with string : l]); 

8 let end of new branch be newTrieNode; 

II If <q, there may be multiple right neighbors. 

// If |Xr(fe)| >q, nothing is done because it has already been visited. 

9 for Xc G {Xj I {Xk,Xj) G Er} do 

10 if visited [c] = false then 

11 |_ |BuildDepthFirst| (Xe, newTrieNode); 



Prom Lemma [71 we can construct a weighted trie T based on a directed spanning tree of Gq 
and label {), where the weighted g-grams in T (represented as length-g paths) correspond to the 
occurrence frequencies of g-grams in T. 

Lemma 8. T can be constructed in time linear in its size. 

Proof. See Algorithm [TJ Let G be the g-gram neighbor graph. We construct T in a depth first 
manner starting at Ajp. The crux of the algorithm is that rather than computing labeli) separately 
for each variable, we are able to aggregate the label{)s and limit all partial decompressions of 
variables to prefixes of variables, so that Lemma [J can be used. 

Any directed acyclic path on G starting at Aj^ can be segmented into multiple sequences of 
variables, where each sequence Xi^ , • • • , Aj^. is such that j is the only integer in [j : k] such that j = 
or |A^(j^.^)| < q. From Lemma [3l we have that Xi._^^, . . . ,Xi^ are uniquely determined. If j > 0, 
label{Xi.) is a prefix of ua/(Aj,(j^.)) since |Aj,(j^._^)| < q (see Fig. [3] Right), and if j = 0, label{Xig) 
is again a prefix of ua/(A^(jg)) = ua/(Aj^). It is not difficult to see that label{Xi.) ■ ■ ■ label{Xi^) is 
also a prefix of Aj,(j_^.) since Xi.^-^ , ■ ■ ■ , Aj^, are all descendants of Aj.(j^.), and each label{) extends the 
partially decompressed string to consider consecutive g-grams in Aj,(j/). Since prefixes of variables 
of SLPs can be decompressed in time proportional to the output size with linear time pre-processing 
(Lemma [1]), the lemma follows. □ 



A minor technicality is that a node in T may have multiple children with the same character label, but this does 
not affect the time complexities of the algorithm. 



We only illustrate how the character labels are determined in the pseudo-code of Algorithm [TJ 
It is straightforward to assign a weight vOcc{Xk) to each node of T that corresponds to lahel{Xk). 

Lemma 9. The number of edges in T is {q — 1) + '^{\ti\ — [q — 1) \ \Xi\ > q,i = 1, . . . ,n} = 
|T[ — dup{q,T) where 

dupiq,T) =Y,{i^Occ{X,) - 1) ■ - {q - 1)) \ \X,\ > q,i = 1, . . . ,n}} 

Proof, {q — 1) + '^{\ti\ — {q — 1) \ \Xi\ > q,i = 1, ... ,n} is straight forward from the definition of 
lahel{Xi) and the construction of T. Concerning dup, each variable Xi occurs vOcc{Xi) times in the 
derivation tree, but only once in the directed spanning tree. This means that for each occurrence 
after the first, the size of T is reduced by \lahel{Xi)\ = \ti\ — {q — 1) compared to T. Therefore, the 
lemma follows. □ 

To efficiently count the weighted g-gram frequencies on T, we can use suffix trees. A suffix tree 
for a trie is defined as a generalized suffix tree for the set of strings represented in the trie as leaf 
to root paths. El The following is known. 

Lemma 10 ([18j). Given a trie of size m, the suffix tree for the trie can he constructed in 0{m) 
time and space. 

With a suffix tree, it is a simple exercise to solve the weighted g-gram frequencies problem on 
T in linear time. In fact, it is known that the suffix array for the common suffix trie can also be 
constructed in linear time [6], as well as its longest common prefix array |13) . which can also be 
used to solve the problem in linear time. 

Corollary 1. The weighted q- gram frequencies problem on a trie of size m can be solved in 0{m) 
time and space. 

From the above arguments, the theorem follows. 

Theorem 1. The q-gram frequencies problem on an SLP T of size n, representing string T can be 
solved in 0{Ta.m.{qn, \T\ — dup{q,T)}) time and space. 

Note that since each q < \ti\ < 2{q — 1), and \label{Xi)\ = \ti\ — {q — 1), the total length of 
decompressions made by the algorithm, i.e. the size of the reduced problem, is at least halved and 
can be as small as 1/q (when all \ti \ = q, for example, in an SLP that represents LZ78 compression), 
compared to the previous 0{qn) algorithm. 

5 Preliminary Experiments 

We first evaluate the size of the trie T induced from the right g-gram neighbor graph, on which 
the running time of the new algorithm of Section U] is dependent. We used data sets obtained from 
Pizza &: Chili Corpus, and constructed SLPs using the RE-PAIR {I4j compression algorithm. Each 
data is of size 200MB. Table [1] shows the sizes of T for different values of g, in comparison with 
the total length of strings tj, on which the previous 0(gn)-time algorithm of Section [3] works. We 
cumulated the lengths of all tiS only for those satisfying \ti\ > q, since no g-gram can occur in tj's 
with \ti\ < q. Observe that for all values of q and for all data sets, the size of T (i.e., the total 
number of characters in T) is smaller than those of tj's and the original string. 



^ When considering leaf to root paths on T, the direction of the string is the reverse of what is in T. However, this 
is merely a matter of representation of the output. 



Table 1. A comparison of the size of T and the total length of strings ti for SLPs that represent textual data from 
Pizza & Chih Corpus. The length of the original text is 209,715,200. The SLPs were constructed by RE-PAIR [14]. 





XML 
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ENGLISH 


PROTEINS 


g 
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E|t.l 


size of T 




size of T 


E|t.l 


size of T 


2 


19,082,988 


9,541,495 


46,342,894 


23,171,448 


37,889,802 


18,944,902 


64,751,926 


32,375,964 


3 


37,966,315 


18,889,991 


92,684,656 


46,341,894 


75,611,002 


37,728,884 


129,449,835 


64,698,833 


4 


55,983,397 


27,443,734 


139,011,475 


69,497,812 


112,835,471 


56,066,348 


191,045,216 


93,940,205 


5 


72,878,965 


35,108,101 


185,200,662 


92,516,690 


148,938,576 


73,434,080 


243,692,809 


114,655,697 


6 


88,786,480 


42,095,985 


230,769,162 


114,916,322 


183,493,406 


89,491,371 


280,408,504 


123,786,699 


7 


103,862,589 


48,533,013 


274,845,524 


135,829,862 


215,975,218 


103,840,108 


301,810,933 


127,510,939 


8 


118,214,023 


54,500,142 


315,811,932 


153,659,844 


246,127,485 


116,339,295 


311,863,817 


129,618,754 


9 


131,868,777 


60,045,009 


352,780,338 


167,598,570 


273,622,444 


126,884,532 


318,432,611 


131,240,299 


10 


144,946,389 


65,201,880 


385,636,192 


177,808,192 


298,303,942 


135,549,310 


325,028,658 


132,658,662 


15 


204,193,702 


86,915,492 


477,568,585 


196,448,347 


379,441,314 


157,558,436 


347,993,213 


138,182,717 


20 


255,371,699 


104,476,074 


497,607,690 


200,561,823 


409,295,884 


162,738,812 


364,230,234 


142,213,239 


50 


424,505,759 


157,069,100 


530,329,749 


206,796,322 


429,380,290 


165,882,006 


416,966,397 


156,257,977 


100 


537,677,786 


192,816,929 


536,349,226 


207,838,417 


435,843,895 


167,313,028 


463,766,667 


168,544,608 



The construction of the suffix tree or array for a trie, as well as the algorithm for Lemma [TJ 
require various tools such as level ancestor queries |5|2|1] for which we did not have an efficient 
implementation. Therefore, we try to assess the practical impact of the reduced problem size using 
a simplified version of our new algorithm. We compared three algorithms (NSA, SSA, STSA) that 
count the occurrence frequencies of all g-grams in a text given as an SLP. NSA is the 0(jT|)-time 
algorithm which works on the uncompressed text, using suffix and LCP arrays. SSA is our previous 
0(gn)-time algorithm [9], and STSA is a simplified version of our new algorithm. STSA further 
reduces the weighted (7-gram frequencies problem on T, to a weighted g'-gram frequencies problem on 
a single string as follows: instead of constructing T, each branch of T (on line [7] of BuildDepthFirst ) 
is appended into a single string. The g-grams that are represented in the branching edges of T can 
be represented in the single string, by redundantly adding su/(Xj,(j)([l : l]),q — 1) in front of the 
string corresponding to the next branch. This leads to some duplicate partial decompression, but 
the resulting string is still always shorter than the string produced by our previous algorithm [9j. 
The partial decompression of Arr(j)([l : /]) is implemented using a simple 0{h + l) algorithm, where 
h is the height of the SLP which can be as large as 0{n). 

All computations were conducted on a Mac Pro (Mid 2010) with MacOS X Lion 10.7.2, and 2 x 
2.93GHz 6-Core Xeon processors and 64GB Memory, only utilizing a single process/thread at once. 
The program was compiled using the GNU C-|— |- compiler (g++) 4.6.2 with the -Of ast option for 
optimization. The running times were measured in seconds, after reading the uncompressed text 
into memory for NSA, and after reading the SLP that represents the text into memory for SSA 
and STSA. Each computation was repeated at least 3 times, and the average was taken. 

Table [2] summarizes the running times of the three algorithms. SSA and STSA computed 
weighted g-gram frequencies on ti and T, respectively. Since the difference between the total length 
of ti and the size of T becomes larger as q increases, STSA outperforms SSA when the value of q is 
not small. In fact, in Table[2]SSA2 was faster than SSA for all values of g > 3. STSA was even faster 
than NSA on the XML data whenever q < 20. What is interesting is that STSA outperformed NSA 
on the ENGLISH data when q = 100. 
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